Bayesian hierarchical lasso Cox model: A 9-gene prognostic signature for overall survival in gastric cancer in an Asian population

Objective Gastric cancer (GC) is one of the most common tumour diseases worldwide and has poor survival, especially in the Asian population. Exploration based on biomarkers would be efficient for better diagnosis, prediction, and targeted therapy. Methods Expression profiles were downloaded from the Gene Expression Omnibus (GEO) database. Survival-related genes were identified by gene set enrichment analysis (GSEA) and univariate Cox. Then, we applied a Bayesian hierarchical lasso Cox model for prognostic signature screening. Protein-protein interaction and Spearman analysis were performed. Kaplan–Meier and receiver operating characteristic (ROC) curve analysis were applied to evaluate the prediction performance. Multivariate Cox regression was used to identify prognostic factors, and a prognostic nomogram was constructed for clinical application. Results With the Bayesian lasso Cox model, a 9-gene signature included TNFRSF11A, NMNAT1, EIF5A, NOTCH3, TOR2A, E2F8, PSMA5, TPMT, and KIF11 was established to predict overall survival in GC. Protein-protein interaction analysis indicated that E2F8 was likely related to KIF11. Kaplan-Meier analysis showed a significant difference between the high-risk and low-risk groups (P<0.001). Multivariate analysis demonstrated that the 9-gene signature was an independent predictor (HR = 2.609, 95% CI 2.017–3.370), and the C-index of the integrative model reached 0.75. Function enrichment analysis for different risk groups revealed the most significant enrichment pathway/term, including pyrimidine metabolism and respiratory electron transport chain. Conclusion Our findings suggested that a novel prognostic model based on a 9-gene signature was developed to predict GC patients in high-risk and improve prediction performance. We hope our model could provide a reference for risk classification and clinical decision-making.

Introduction samples. Both datasets were generated by the Affymetrix Human Genome U133 Plus 2.0 platform, and GSE66229 was normalized by robust multiarray average with the Affymetrix Power Tools package. GSE15459 was handled by Microarray Suite version 5.0 using Affymetrix default analysis settings and global scaling as a normalization method. Log2 transformation was utilized in this study.

Gene Set Enrichment Analysis (GSEA)
In contrast to traditional analysis, GSEA was not limited to providing a clear threshold (e.g., log2FC) for differentially expressed genes but was focused on genes contributing to specific biological function gene sets [13]. It could avoid removing important genes with no statistically significant expression differences. In this study, samples from GSE66229 were divided into the tumour and non-tumour groups. Then, GSEA v4.1.0 was used to analyze gene data. Based on the Molecular Signatures Database, the gene sets of Kyoto Encyclopedia of Genes and Genomes (KEGG, c2.cp.kegg.v7.2.symbols.gmt) and Gene Ontology (GO, c5.go.v7.2.symbols. gmt) were set as reference datasets. Nom-P value<0.05 and false discovery rate (FDR) <0. 25 were set as the cut-off values. In addition, function enrichment analysis based on the same settings was performed for different risk groups.

Establishment of the prognostic signature
To screen prognostic genes, we split this process into two steps. First, univariate Cox regression was used to choose survival-related genes from the leading edge analysis in GSEA. Then, the spike-and-slab lasso Cox proposed by Tang and Yi et al. [12] was used to further identify prognostic genes. The spike-and-slab mixture double-exponential prior applied in this model was the key part and was expressed as follows: It had two positive value parameters, s 1 and s 0 (s 1 >s 0 >0), which need to be preset. s 0 was chosen to be small and regarded as a "spike scale" for giving strong shrinkage on coefficient estimation, while s 1 was set to be large so that it served as a "slab scale" for giving weak shrinkage on important variables. γ j was the indicator variable that linked the scale parameters with the coefficients. The algorithm for fitting the spike-and-slab Cox model was called the expectation-maximization (EM) cyclic coordinate descent algorithm, which had a fast computing speed [12]. After filtering prognostic genes by the Bayesian hierarchical lasso Cox model, the risk score was calculated for each patient, and the formula was as follows: β i represents the corresponding coefficient of a specific gene, and the expression () indicates the expression level of the corresponding gene. Next, samples were divided into high-risk or low-risk groups according to the median risk score. Kaplan-Meier and receiver operating characteristic (ROC) curve analysis were performed to assess the predictive effect. Also, we integrated the risk score and clinical factors into one model to build the final prognostic model and evaluated the performance by the C-index in the training set and validation set. Then, forest plots, nomogram plots, and calibration plots were used to demonstrate the main result fully.

Correlation and validation analysis
To further explore possible associations, we firstly applied the Search Tool for the Retrieval of Interacting Genes (STRING, https://cn.string-db.org/) [14] to construct the Protein-Protein interaction (PPI) network based on the above genes. Then Spearman correlation analysis was performed to investigate the association at expression levels. On the other hand, the protein expression levels of the 9 genes were verified using the publicly Human Protein Altas (HPA) database (https://www.proteinatlas.org/) [15]. To better deal with the HPA data, we adopted the HPAanalyze package, a powerful tool for searching and analyzing the HPA database [16].

External validation of prognostic model
We downloaded GSE15459 as an independent external validation set, which comprised the gene expression and clinical data of 182 samples. The risk scores of the GC patients were calculated and divided into high-risk and low-risk groups based on the median value. The robustness of this model was tested by Kaplan-Meier and ROC curve analysis.

Implementation
Statistical analysis was performed based on R v4.0.2. bmlasso() was used for fitting the Bayesian hierarchical lasso Cox model, and cv.bh() was executed to select optimal s0 based on the predictive performance of the model. These functions are from the freely available R package BhGLM [17].

Identification of survival-related genes
To summarize this study more comprehensively, a schematic diagram was provided in Fig  1. GSE66229 had 400 samples (with a total of 20161 genes), including 300 cases and 100 normal samples. The GSEA results revealed that 331 genes involved in 15 pathways from KEGG and 2611 genes involved in 595 terms (biological process, molecular function, and cellular component) were filtered out with FDR<0.25 and Nom-P<0.05 (S1 Table). After removing duplicated genes, a total of 2641 genes were selected for the subsequent process. Among the above processes, pyrimidine metabolism (Nom-p = 0), spliceosome (Nomp = 0.024), RNA export from the nucleus (Nom-p = 0), and viral gene expression (Nomp = 0) from KEGG and GO based on the largest absolute normalized enrichment score (NES) were shown in Fig 2. After excluding samples with missing values (i.e., survival time<1 month, survival outcome, pathological stage, age and sex), univariate Cox proportional hazards regression analysis was used to identify survival-related genes. Basic clinical characteristics were described completely in Table 1, and the results showed that 1109 genes were significantly correlated with OS at P<0.05.

Bayesian hierarchical lasso Cox for screening final prognostic genes
The selection criterion of two parameters, s 1 and s 0 , has been sufficiently discussed in a previous study [18]. The variety of the C-indexes of the survival model was sensitive to the change in s 0 but less susceptible to s 1 . Therefore, we decided to fix s 1 at 1 according to previous research. Regarding the value of s 0 , we first used the glmNet() function from BhGLM package to simulate 10-fold cross-validation repeated 10 times to obtain the stable penalty parameter λ (i.e., s λ ) and then adjusted the value with a limited range from -0.04 to 0.06 with intervals of 0.01. Our goal was to find an optimal value that simultaneously made the C-index larger and deviance smaller. According to the results of 10-fold cross-validation with repeated 10 times, we ultimately decided to choose s 0 = s λ -0.04 as an optimal value (Table 2). At the same time, the C-index and deviance of the constructed Bayesian hierarchical lasso Cox model were 0.684 (sd = 0.004) and 1582.454 (sd = 3.734), respectively. The C-index of traditional lasso Cox regression was 0.643 (sd = 0.011), which was lower than our Bayesian hierarchical lasso Cox model. Afterwards, we chose 9 prognostic genes whose coefficients were not zero, including tumour necrosis factor receptor superfamily member 11A (TNFRSF11A), nicotinamide nucleotide adenylyltransferase 1 (NMNAT1), eukaryotic translation initiation factor 5A (EIF5A), notch receptor 3 (NOTCH3), torsin family 2 member A (TOR2A), E2F transcription factor 8 (E2F8), proteasome 20S subunit alpha 5 (PSMA5), thiopurine S-methyltransferase (TPMT), and kinesin family member 11 (KIF11) ( Table 3).

PPI, Spearman correlation and validation analysis
PPI analysis was carried out to investigate possible inter-relationships of 9 genes (Fig 3A). According to the STRING database, there existed an interaction between E2F8 and KIF11. Except for the association between NMNAT1 and EIF5A (P>0.05), other pairs were statistically significant (P<0.05) (S2 Table). Notably, the expression of NOTCH3 was negatively associated with other genes, and the strongest association came from the expression of E2F8 and

PLOS ONE
KIF11 (r = 0.61) ( Fig 3B). Further, the protein levels of these genes were also explored through the HPA database (S2 Fig). The result showed that three genes (E2F8, KIF11, TPMT) had been found to be highly expressed in GC tissue.

Prognostic analysis of the 9-gene signature
We developed a prognostic signature based on the risk score constructed by the above 9 genes, and the formula of the risk score was given as follows: Riskscore = (-0.2277) � TNFRSF11A After calculating the risk score for each patient, the median risk score (median = -0.077) was regarded as the cut-off value that stratified GC patients into low-risk and high-risk groups. Among them, the range of risk scores was [-1.569, 2.104]. The low-risk group was defined as [-1.569, -0.077), and the high-risk group was defined as [-0.077, 2.104]. The Kaplan-Meier survival analysis demonstrated a statistically significant difference between the low-risk and highrisk groups (P<0.0001, Fig 4A), and the AUC of the risk score was 0.765 ( Fig 4B). The distribution of survival status is also shown in Fig 5A. As the risk score of the GC patients increased,

Function enrichment analysis
To elucidate the possible biological terms or pathways associated with the risk score, we also utilized GSEA to perform GO and KEGG pathway analyses based on differentially expressed risk genes (DRGs) between the low-risk and high-risk groups. As shown in the chart, we intuitively discovered that DRGs were enriched mainly in particular molecular terms/pathways, such as pyrimidine metabolism and respiratory electron transport chain (Fig 6A and 6B).

Combining 9-gene signature with clinical characteristics
Next, we integrated the gene signature plus several clinical factors, including age, sex and stage, into a super prognostic model to completely predict the OS of GC patients. After integration, the C-index of the final prognostic model reached 0.75, in contrast to the model that

PLOS ONE
considered only clinical factors (C-index = 0.686) (Fig 7A). The HR of the risk score was 2.609, 95% CI: 2.017-3.370. Finally, considering the application of clinical practice, we utilized a nomogram to predict the survival probability of GC patients ( Fig 8A). Moreover, calibration plots were used to illustrate the stability of the nomogram in predicting 1-year, 3-year, or 5-year OS (Fig 8B-8D).

Independent external validation in GSE15459
To assess the robustness of the 9-gene prognostic signature, we selected GSE15459 as an independent external validation set. Similar to the training set, Kaplan-Meier analysis indicated that low-risk patients had longer survival times than high-risk patients (P = 0.0001, Fig 4C), and the overall AUC of the risk score was 0.703 ( Fig 4D). The distribution between the risk score and survival status is displayed in Fig 5B. We observed that the expression levels between different risk groups were similar to those in the training set, which further verified the accuracy of our results. After integrating clinical characteristics, the C-index of the final prognostic model also increased to 0.75, in contrast to the model with only clinical characteristics (Cindex = 0.696). Multivariate analysis demonstrated that the risk score could be a stable prognostic factor for the prediction of OS (HR = 1.815, 95% CI: 1.288-2.560, P<0.001) (Fig 7B).

Discussion
As one of the most common tumour diseases, GC has a relatively high incidence and mortality rate, especially in Asia [1]. Reliable markers could be applied as the vital indicator for GC risk stratification and following treatment. Compared to the traditional model with clinical characteristics, the prognostic model combining biomarkers would be more powerful for prediction performance, especially in the current era of precision medicine. However, the high-dimensional feature of genetic data makes it difficult to model directly, and some effective statistical methods are needed, such as PCA, ridge, and lasso [4]. In this study, we used a Bayesian approach (also known as spike-and-slab lasso Cox) to screen target genes [12]. This method integrated mainly the penalized lasso and Bayesian variable selection, which was superior to lasso. First, this Bayesian model could achieve the same function of variable selection as lasso Cox. Second, compared with lasso Cox giving the same penalty parameter to all coefficients, spike-and-slab lasso Cox could achieve more flexibility. In other words, it could selectively shrink different predictors based on different scales from the data (i.e., giving relatively small shrinkage to those predictors with large effect and giving strong shrinkage to irrelevant or weak predictors at the same time). Therefore, to some extent, this Bayesian model could reduce the estimation bias of lasso Cox. Our study also demonstrated that the cross-validated C-index of the Bayesian model was better than the traditional lasso (0.684 vs 0.643).
After choosing the optimal model, 9 prognostic genes with non-zero coefficients were selected. Next, we established the risk score index based on these genes, and Kaplan-Meier analysis demonstrated that GC patients in the high-risk group (> = -0.077) had a shorter survival time than those in the low-risk group (<-0.077). Additionally, we tested the predictive power of the 9-gene signature alone, and the results showed that it was a great predictor for OS (AUC = 0.765 in the training set; AUC = 0.703 in the validation set). Compared to a 4-gene signature reported by a previous study, our signature acquired a higher level (AUC for OS, 0.765 vs 0.684) [19]. Furthermore, we integrated the risk score with clinical factors (age, sex, pathological stage) to construct a multivariate Cox regression model. The results proved that this risk signature could be an independent prediction marker for GC. The C-index of the integrative model reached 0.75 in both the training set and validation set, which showed the reliable performance of our model. Finally, considering the possible clinical application, a nomogram was provided to visually predict the GC patients' survival. By evaluating 1-year, 3-year, and 5-year OS calibration curves, the drawn nomogram had a relatively high accuracy of prediction.
Through the Bayesian hierarchical lasso Cox model, the 9 genes we found were also the focus of other basic researches or population studies. NOTCH3, a member of the NOTCH family, is involved in the NOTCH signalling pathway, which is regarded as one of the key pathways constituting the stem cell signalling network [20]. Our results demonstrated that the increased expression of NOTCH3 was associated with poor OS among GC patients, which was consistent with previous studies [21,22]. TNFRSF11A, also regarded as the receptor activator of NF-κB (RANK), can activate several pathways, such as NF-κB, JNK, ERK, p38, and Akt/ PKB, and was reported to be a novel and frequent target for de novo methylation in gliomas [23]. Our study showed that the expression of TNFRSF11A was positively associated with survival (HR = 0.7964<1), which was consistent with another study [24]. In population-based studies, TPMT, TOR2A, KIF11, and EIF5A were reported to be associated with prognosis in other tumours, such as childhood acute lymphoblastic leukaemia [25,26], ovarian cancer [27,28], breast cancer [29], and colorectal cancer [30].
NMNAT1 is involved in the NAD+ salvage/recycling pathway, which is crucial for maintaining the functions of a wide variety of NAD+-dependent enzymes in the cytoplasm and nucleus [31]. Knockdown of NMNAT1 enhanced rRNA transcription, which might facilitate increased ribosome biogenesis and tumour development [32]. However, to our knowledge, for the NMNAT1 signature, there have been no relevant studies based on populations. Our study revealed that high expression of NMNAT1 was associated with lower mortality risk and higher OS than low expression. E2F8 is a member of the E2F family of transcription factors that regulates various cellular functions related to the cell cycle and apoptosis [33]. E2F8 is considered to be a kind of transcriptional repressor that is similar to E2F7 in that it can inhibit E2F-driven promoters [34]. A previous study found that the increased expression of E2F family members (E2F2, E2F5, E2F6, and E2F7) was significantly associated with favourable OS in GC [35]. Our research further revealed that increased expression of E2F8 was also associated with favourable prognosis.
Notably, some previously reported genes, such as PSMA5 and KIF11, can be treated as potential therapeutic targets for tumours [36,37]. Previous studies demonstrated that the upregulation of PSMA5, by activating the key Nrf2/ARE signalling pathway, played a critical role in the mechanism of inducing tumour cell apoptosis caused by combined chemotherapy regimens [38,39]. KIF11 silencing induced chromosome instability (CIN), which might contribute to cancer development and progression [40]. However, other studies showed that the suppression of PSMA5 could strengthen the sensitivity of myeloma to bortezomib [41]. Apigenin induced apoptosis in prostate cancer cells, which was accompanied by the downregulated expression of PSMA5 [42]. These studies suggested that the same gene might involve different mechanisms in different tumour types.
In general, our study used a Bayesian hierarchical lasso Cox model to screen a prognosisrelated gene signature. It was the first to apply this Bayesian approach to construct prognosisrelated models in GC. Notably, there might be some restrictions regarding generalizability to populations in other regions because this model we built was based on the Asian population. Moreover, our analysis focused mainly on mRNA expression data, but other molecular types, such as microRNA, CNV, or methylation, might contain important prognostic information in GC. Therefore, in further research, we would consider establishing a more generalized model that combines different molecular data for better prediction based on our Bayesian approach. Finally, although the 9-gene signature was explored by statistical analysis or database validation, we expect to further verify these genes by in vitro/vivo experiments in the future study.

Conclusion
Our research confirmed that the Bayesian hierarchical lasso Cox model had great prediction power than the traditional Cox model. Based on this Bayesian approach, we proposed a 9-gene prognostic signature as an independent predictor for the overall survival of GC patients. Finally, combined with clinical characteristics, a comprehensive nomogram was provided for clinical application. Overall, our study offers certain reference significance for clinical prognosis prediction in GC.